1 Geodemographics

This week we will see how we can use socio-demographic and socio-economic data to characterise neighbourhoods using geodemographics. Geodemographics is the “analysis of people by where they live’ (Harris et al. 2005) and as such entails representing the individual and collective identities that are manifest in observable neighbourhood structure” (Longley 2012). We will look at geodemographics by focusing on a existing geodemographic classification known as the Internet User Classification.

1.1 Reading list

Essential readings

  • Longley, P. A. 2012. Geodemographics and the practices of geographic information science. International Journal of Geographical Information Science 26(12): 2227-2237. [Link]
  • Martin, D., Gale, C., Cockings, S. et al. 2018. Origin-destination geodemographics for analysis of travel to work flows. Computers, Environment and Urban Systems 67: 68-79. [Link]
  • Singleton, A., Alexiou, A. and Savani, R. 2020. Mapping the geodemographics of digital inequality in Great Britain: An integration of machine learning into small area estimation. Computers, Environment and Urban Systems 82: 101486. [Link]
  • Singleton, A. and Spielman, S. 2014. The past, present, and future of geodemographic research in the United States and United Kingdom. The Professional Geographer 66(4): 558-567. [Link]

Suggested readings

  • Goodman, A., Wilkinson, P., Stafford, M. et al. 2011. Characterising socio-economic inequalities in exposure to air pollution: A comparison of socio-economic markers and scales of measurement. Health & Place 17(3): 767-774. [Link]

1.2 Geodemographics

The CDRC Internet User Classification (IUC) is a bespoke geodemographic classification that describes how people residing in different parts of Great Britain interact with the Internet. For every Lower Super Output Area (LSOA) in England and Wales and Data Zone (DZ) (2011 Census Geographies), the IUC provides aggregate population estimates of Internet use (Singleton et al. 2020) and provides insights into the geography of the digital divide in the United Kingdom.

“Digital inequality is observable where access to online resources and those opportunities that these create are non-egalitarian. As a result of variable rates of access and use of the Internet between social and spatial groups (..), this leads to digital differentiation, which entrenches difference and reciprocates digital inequality over time (Singleton et al. 2020).”

1.2.1 Internet User Classification I

For the first part of this week’s practical material, we will be looking at the Internet User Classification (IUC) for Great Britain in more detail by mapping it.

Our first step is to download the IUC data set:

  • Open a web browser and go to the data portal of the CDRC.
  • Register if you need to, or if you are already registered, make sure you are logged in.
  • Search for Internet User Classification.
  • Scroll down and choose the download option for the IUC 2018 (CSV).
  • Save the iuc_gb_2018.csv file in an appropriate folder.
Download the GB IUC 2018.

Figure 1.1: Download the GB IUC 2018.

Start by inspecting the data set in MS Excel, or any other spreadsheet software such as Apache OpenOffice Calc or Numbers. Also, have a look at the IUC 2018 User Guide that provides the pen portraits of every cluster, including plots of cluster centres and a brief summary of the methodology.

Note
It is always a good idea to inspect your data prior to analysis to find out how your data look like. Of course, depending on the type of data, you can choose any tool you like to do this inspection (ArcGIS, R, QGIS, Microsoft Excel, etc.).

GB IUC 2018 in Excel.

Figure 1.2: GB IUC 2018 in Excel.

# load libraries
library(tidyverse)
library(tmap)

# load data
iuc <- read_csv("data/index/iuc_gb_2018.csv")

# inspect
iuc
## # A tibble: 41,729 × 5
##    SHP_ID LSOA11_CD LSOA11_NM              GRP_CD GRP_LABEL                    
##     <dbl> <chr>     <chr>                   <dbl> <chr>                        
##  1      1 E01020179 South Hams 012C             5 e-Rational Utilitarians      
##  2      2 E01033289 Cornwall 007E               9 Settled Offline Communities  
##  3      3 W01000189 Conwy 015F                  5 e-Rational Utilitarians      
##  4      4 W01001022 Bridgend 014B               7 Passive and Uncommitted Users
##  5      5 W01000532 Ceredigion 007B             9 Settled Offline Communities  
##  6      6 E01018888 Cornwall 071G               9 Settled Offline Communities  
##  7      7 E01018766 Cornwall 028D               9 Settled Offline Communities  
##  8      8 E01019948 East Devon 010C             9 Settled Offline Communities  
##  9      9 W01000539 Ceredigion 005D             5 e-Rational Utilitarians      
## 10     10 E01019171 Barrow-in-Furness 005E      6 e-Mainstream                 
## # … with 41,719 more rows
# inspect data types
str(iuc)
## spec_tbl_df [41,729 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SHP_ID   : num [1:41729] 1 2 3 4 5 6 7 8 9 10 ...
##  $ LSOA11_CD: chr [1:41729] "E01020179" "E01033289" "W01000189" "W01001022" ...
##  $ LSOA11_NM: chr [1:41729] "South Hams 012C" "Cornwall 007E" "Conwy 015F" "Bridgend 014B" ...
##  $ GRP_CD   : num [1:41729] 5 9 5 7 9 9 9 9 5 6 ...
##  $ GRP_LABEL: chr [1:41729] "e-Rational Utilitarians" "Settled Offline Communities" "e-Rational Utilitarians" "Passive and Uncommitted Users" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SHP_ID = col_double(),
##   ..   LSOA11_CD = col_character(),
##   ..   LSOA11_NM = col_character(),
##   ..   GRP_CD = col_double(),
##   ..   GRP_LABEL = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Now the data are loaded we can move to acquiring our spatial data. As the IUC is created at the level of the Lower layer Super Output Area Census geography, we need to download its administrative borders. As the data set for the entire country is quite large, we will focus on Liverpool.

  1. Go to the UK Data Service Census support portal and select Boundary Data Selector.
  2. Set Country to England, Geography to Statistical Building Block, dates to 2011 and later, and click Find.
  3. Select English Lower Layer Super Output Areas, 2011 and click List Areas.
  4. Select Liverpool from the list and click Extract Boundary Data.
  5. Wait until loaded and download the BoundaryData.zip file.
  6. Unzip and save the file in the usual fashion.

Note
You could also have downloaded the shapefile with the data already joined to the LSOA boundaries directly from the CDRC data portal, but this is the national data set and is quite large (75MB). Also, as we will be looking at Liverpool today we do not need all LSOAs in Great Britain. Of course, you could have used this file and filtered out all LSOAs that fall within the boundaries of Liverpool.

Now we got the administrative boundary data, we can prepare the IUC map by joining our csv file with the IUC classification to the shapefile.

# load libraries
library(sf)
library(tmap)

# load spatial data
liverpool <- st_read("data/boundaries/england_lsoa_2011.shp")
## Reading layer `england_lsoa_2011' from data source 
##   `/Users/justinvandijk/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/data/boundaries/england_lsoa_2011.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 298 features and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 332390.2 ymin: 379748.5 xmax: 345636 ymax: 397980.1
## Projected CRS: OSGB36 / British National Grid
# inspect
plot(liverpool$geometry)

# join data
liv_iuc <- left_join(liverpool, iuc, by = c(code = "LSOA11_CD"))

# inspect
liv_iuc
## Simple feature collection with 298 features and 7 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 332390.2 ymin: 379748.5 xmax: 345636 ymax: 397980.1
## Projected CRS: OSGB36 / British National Grid
## First 10 features:
##                          label           name      code SHP_ID      LSOA11_NM
## 1  E08000012E02006934E01033755 Liverpool 062D E01033755  25097 Liverpool 062D
## 2  E08000012E02006932E01033758 Liverpool 060B E01033758  24070 Liverpool 060B
## 3  E08000012E02001356E01033759 Liverpool 010F E01033759  26845 Liverpool 010F
## 4  E08000012E02006932E01033762 Liverpool 060E E01033762  26866 Liverpool 060E
## 5  E08000012E02001396E01032505 Liverpool 050F E01032505  27848 Liverpool 050F
## 6  E08000012E02001396E01032506 Liverpool 050G E01032506   2429 Liverpool 050G
## 7  E08000012E02001396E01032507 Liverpool 050H E01032507  24242 Liverpool 050H
## 8  E08000012E02001373E01032508 Liverpool 027G E01032508  28413 Liverpool 027G
## 9  E08000012E02001373E01032509 Liverpool 027H E01032509  24339 Liverpool 027H
## 10 E08000012E02001354E01032510 Liverpool 008F E01032510  25167 Liverpool 008F
##    GRP_CD                     GRP_LABEL                       geometry
## 1       2               e-Professionals POLYGON ((334276.7 391012.8...
## 2       4         Youthful Urban Fringe POLYGON ((335723 391178, 33...
## 3       7 Passive and Uncommitted Users POLYGON ((338925 394476, 33...
## 4       1           e-Cultural Creators POLYGON ((334612.4 391111.7...
## 5       7 Passive and Uncommitted Users POLYGON ((335894.7 387448.3...
## 6       6                  e-Mainstream POLYGON ((336256.7 387691.8...
## 7       3                    e-Veterans POLYGON ((336803.5 387432.7...
## 8      10                   e-Withdrawn POLYGON ((339299 391470, 33...
## 9       7 Passive and Uncommitted Users POLYGON ((338901 391308, 33...
## 10      7 Passive and Uncommitted Users POLYGON ((338018.2 395716.4...
# inspect
tmap_mode("view")
tm_shape(liv_iuc) + tm_fill(col = "GRP_LABEL") + tm_layout(legend.outside = TRUE)

Let’s use the same colours as used on CDRC mapmaker by specifying the hex colour codes for each of our groups. Note the order of the colours is important: the colour for group 1 is first, group 2 second and so on.

# define palette
iuc_colours <- c("#dd7cdc", "#ea4d78", "#d2d1ab", "#f36d5a", "#a5cfbc", "#e4a5d0",
    "#8470ff", "#79cdcd", "#808fee", "#ffd39b")

# plot pretty
tm_shape(liv_iuc) + tm_fill(col = "GRP_LABEL", palette = iuc_colours) + tm_layout(legend.outside = TRUE)

1.2.2 Tutorial task I

Now we have these cluster classifications, how can we link them to people? Try using the Mid-Year Population Estimates 2019 that you can download below to:

  • calculate the total number of people associated with each cluster group for England and Wales as a whole (not just Liverpool!); and
  • create a pretty data visualisation showing the results (no map!).

File download

File Type Link
LSOA-level Mid-Year Population Estimates England and Wales 2019 csv Download
Lower-layer Super Output Areas Great Britain 2011 shp Download

1.2.3 k-means clustering

In several cases, including the 2011 residential-based area classifications and the Internet User Classification, a technique called k-means clustering is used in the creation of a geodemographic classification. K-means clustering aims to partition a set of observations into a number of clusters (k), in which each observation will be assigned to the cluster with the nearest mean. As such, a cluster refers to a collection of data points aggregated together because of certain similarities (i.e. standardised scores of your input data). In order to run a k-means clustering, you first define a target number k of clusters that you want. The k-means algorithm subsequently assigns every observation to one of the clusters by finding the solution that minimises the total within-cluster variance. For the second part of this week’s practical material, we will be replicating part of the Internet User Classification for Great Britain. For this we will be using an MSOA-level input data set containing various socio-demographic and socio-economic variables that you can download below together with the MSOA administrative boundaries.

The data set contains the following variables:

Variable Definition
msoa11cd MSOA Code
age_total, age0to4pc, age5to14pc, age16to24pc, age25to44pc, age45to64pc, age75pluspc Percentage of people in various age groups
nssec_total, 1_higher_managerial, 2_lower_managerial, 3_intermediate_occupations, 4_employers_small_org, 5_lower_supervisory, 6_semi_routine, 7_routine, 8_unemployed Percentage of people in selected operational categories and sub-categories classes drawn from the National Statistics Socio-economic Classification (NS-SEC)
avg_dwn_speed, avb_superfast, no_decent_bband, bband_speed_under2mbs, bband_speed_under10mbs, bband_speed_over30mbs Measures of broadband use and internet availability

File download

File Type Link
MSOA-level input variables for IUC csv Download
Middle-layer Super Output Areas Great Britain 2011 shp Download
# load data
iuc_input <- read_csv("data/index/msoa_iuc_input.csv")

# inspect
head(iuc_input)
## # A tibble: 6 × 23
##   msoa11cd  age_total age0to4pc age5to…¹ age16…² age25…³ age45…⁴ age75…⁵ nssec…⁶
##   <chr>         <dbl>     <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 E02000001      7375    0.032    0.0388  0.0961   0.407   0.273  0.0607    5816
## 2 E02000002      6775    0.0927   0.122   0.113    0.280   0.186  0.0980    3926
## 3 E02000003     10045    0.0829   0.102   0.118    0.306   0.225  0.0646    6483
## 4 E02000004      6182    0.0590   0.102   0.139    0.254   0.250  0.0886    4041
## 5 E02000005      8562    0.0930   0.119   0.119    0.299   0.214  0.0501    5368
## 6 E02000007      8791    0.103    0.125   0.129    0.285   0.197  0.0688    5158
## # … with 14 more variables: `1_higher_managerial` <dbl>,
## #   `2_lower_managerial` <dbl>, `3_intermediate_occupations` <dbl>,
## #   `4_employers_small_org` <dbl>, `5_lower_supervisory` <dbl>,
## #   `6_semi_routine` <dbl>, `7_routine` <dbl>, `8_unemployed` <dbl>,
## #   avg_dwn_speed <dbl>, avb_superfast <dbl>, no_decent_bband <dbl>,
## #   bband_speed_under2mbs <dbl>, bband_speed_under10mbs <dbl>,
## #   bband_speed_over30mbs <dbl>, and abbreviated variable names ¹​age5to14pc, …